Video Fill in the Blank with Merging LSTMs

نویسندگان

  • Amir Mazaheri
  • Dong Zhang
  • Mubarak Shah
چکیده

Given a video and its incomplete textural description with missing words, the Video-Fill-in-the-Blank (ViFitB) task is to automatically find the missing word. The contextual information of the sentences are important to infer the missing words; the visual cues are even more crucial to get a more accurate inference. In this paper, we presents a new method which intuitively takes advantage of the structure of the sentences and employs merging LSTMs (to merge two LSTMs) to tackle the problem with embedded textural and visual cues. In the experiments, we have demonstrated the superior performance of the proposed method on the challenging “Movie Fill-in-the-Blank” dataset [5].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Test Method Facet and the Construct Validity of Listening Comprehension Tests

The assessment of listening abilities is one of the least understood, least developed and, yet, one of the most important areas of language testing and assessment. It is particularly important because of its potential wash-back effects on classroom practices. Given the fact that listening tests play a great role in assessing the language proficiency of students, they are expected to enjoy a hig...

متن کامل

R-Tree for phase change memory

Nowadays, many applications use spatial data for instance-location information, so storing spatial data is important. We suggest using R -Tree over PCM. Our objective is to design a PCM-sensitive R -Tree that can store spatial data as well as improve the endurance problem. Initially, we examine how R -Tree causes endurance problems in PCM, and we then optimize it for PCM. We propose doubling th...

متن کامل

Uncovering Temporal Context for Video Question and Answering

In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future. We present an encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using ques...

متن کامل

Recognition of Visual Events using Spatio-Temporal Information of the Video Signal

Recognition of visual events as a video analysis task has become popular in machine learning community. While the traditional approaches for detection of video events have been used for a long time, the recently evolved deep learning based methods have revolutionized this area. They have enabled event recognition systems to achieve detection rates which were not reachable by traditional approac...

متن کامل

Unsupervised Learning of Video Representations using LSTMs

We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kind...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1610.04062  شماره 

صفحات  -

تاریخ انتشار 2016